Day 2 Session 1:
Dynamic spectral analysis
using Functional Principal Component Analysis (FPCA)

Introduction

Principal Component Analysis (PCA)

Let’s apply Principal Component Analysis (PCA) onto the static data.

Preliminaries

Importing data set

Importing data set

Let’s import the data set. We are using the data set openly available on the Open Science Framework (OSF) repository.

# import the csv file "initial.liquid.static.csv" from the "data" directory and save it as "df_mid"
df_mid <- readr::read_csv("data/initial.liquid.static.csv")

Checking data and data wrangling

As usual, let’s check what variables are included in the data set using colnames().

# Let's check what columns the data frame contains
colnames(df_mid)
##  [1] "...1"           "file"           "speaker"        "language"      
##  [5] "duration"       "segment"        "word"           "f1"            
##  [9] "f2"             "f3"             "previous_sound" "next_sound"    
## [13] "percent"        "IsApproximant"  "IsAcoustic"     "gender"        
## [17] "omit"           "position"

Also remove irrelevant columns and rename the variables in the vowel column as has been done previously.

# Remove columns 
df_mid <- df_mid |> 
  dplyr::select(-c(IsApproximant, IsAcoustic, omit))

# Rename vowel variables
df_mid <- df_mid |> 
  dplyr::mutate(
    vowel = case_when(
      next_sound == "AE1" ~ "/æ/",
      next_sound == "IY1" ~ "/i/",
      next_sound == "UW1" ~ "/u/",
    )
  )

# z-normalise formant values for cross-speaker comparison
df_mid <- df_mid |> 
  dplyr::group_by(speaker) |> 
  dplyr::mutate(
    f1z = scale(f1),
    f2z = scale(f2),
    f3z = scale(f3)
  ) |> 
  dplyr::ungroup()

Your turn

Please write a code below to inspect the data. Try obtaining descriptive statistics (e.g., the number of tokens, participants etc)

# Check column names again
# ...

Data visualisation

Let’s try data visualisation. First, we’ll compare between-group difference using a combination of scatter, box and violin plots. In order to plot everything altogether, we’ll convert the data set into long data using the tidyr::pivot_longer() function.

# convert the data set into a long data
df_mid_long <- df_mid |> 
  dplyr::select(-c(f1, f2, f3)) |> 
  tidyr::pivot_longer(14:16, names_to = "formant", values_to = "values")

# plot
df_mid_long |> 
  ggplot(aes(x = language, y = values)) +
  geom_jitter(aes(colour = language), alpha = 0.5) +
  geom_violin(alpha = 0.4) +
  geom_boxplot(width = 0.3, alpha = 0.4) +
  facet_grid(vowel ~ formant) +
  scale_colour_manual(values = cbPalette) +
  theme(
    strip.text.y = element_text(angle = 360)
  )

Let’s also take a slightly different approach. The objective of PCA is to draw straight lines along the dimension that shows the greatest variance. To inspect the variance in the data, we’ll plot the relationships between the three spectral dimensions.

# F1 and F2
df_mid |> 
  ggplot(aes(x = f1z, y = f2z)) +
  geom_point(aes(colour = language), alpha = 0.4) +
  scale_colour_manual(values = cbPalette) +
  facet_wrap(~ segment)

# F2 and F3
df_mid |> 
  ggplot(aes(x = f2z, y = f3z)) +
  geom_point(aes(colour = language), alpha = 0.4) +
  scale_colour_manual(values = cbPalette) +
  facet_wrap(~ segment)

# F1 and F3
df_mid |> 
  ggplot(aes(x = f1z, y = f3z)) +
  geom_point(aes(colour = language), alpha = 0.4) +
  scale_colour_manual(values = cbPalette) +
  facet_wrap(~ segment)

As expected, English /l/ and /ɹ/ can be reliably distinguished along the F3 dimension, such that English /ɹ/ shows lower F3 than English /l/.

The key takeaway here, however, is that we always need multiple plots in order to obtain a holistic understanding of the data structure, which is somewhat cumbersome. Here, PCA comes in handy, as it is a dimension reduction technique and boils down the variance in the data into a handful of principal components.

Applying PCA

Let’s apply PCA on the formant data. We’ll classify the tokens based on the z-scored F1, F2 and F3. We use the prcomp() function to perform PCA.

# PCA on F1z, F2z and F3z
pca_mid <- prcomp(select(df_mid, 17:19))

# check summary
summary(pca_mid)
## Importance of components:
##                           PC1    PC2    PC3
## Standard deviation     1.1297 0.9955 0.8218
## Proportion of Variance 0.4337 0.3368 0.2295
## Cumulative Proportion  0.4337 0.7705 1.0000
pca_var <- pca_mid$sdev^2
pca_var_exp <- pca_var / sum(pca_var)  # Proportion of variance

The summary shows three rows:

  • Standard deviation: This fundamentally shows the variance shown in the data. The greater the value, the more variance is captured by each PC.

  • Proportion of Variance: This expresses the amount of variance explained by each PC in a proportional manner. This can be calculated with the following formulae:

\[ pca\_var\_exp = \frac{(\text{sdev})^2}{\sum pca\_var} \]

  • Cumulative Proportion: This simply shows the sum of the proportion of variance explained by each PC. For instance, PC1 explains 43.37% of variance whereas PC2 33.68%. The cumulative proportion of PC1 and PC2 is thus 77.05% (i.e., 43.37% + 33.68%).

PCA summary

Let’s also look into the inside of pca_mid.

# show the PCA results
pca_mid
## Standard deviations (1, .., p=3):
## [1] 1.1297301 0.9955437 0.8217881
## 
## Rotation (n x k) = (3 x 3):
##            PC1         PC2        PC3
## f1z -0.1095097  0.97205301  0.2076550
## f2z -0.6942959 -0.22430409  0.6838428
## f3z -0.7113093  0.06928656 -0.6994559

The top line shows standard deviation, which we just saw earlier, so let’s skip this for now.

At the bottom, we have loadings matrix. This shows eigenvectors: i.e., the amount and (2) the direction of the contribution that the original data dimensions make to each PC. Let’s disentangle this one by one:

  • PC1: Contributions of f1z, f2z and f3z are all in the same direction (i.e., negative). But the amount of contribution is fairly small for f1z compared to f2z and f3z.

  • PC2: In contrast to PC1, f1z contributes to PC2 to a greater extent than f2z and f3z.

  • PC3: The amount of contribution itself is quite similar to PC1, in which f2z and f3z makes a substantial contribution. But the directionality is opposite between f2z and f3z.

So, all in all, PCA might suggest:

  • PC1 captures a covariation between f2z and f3z, which may correspond to the spectral difference between /l/ and /ɹ/.

  • PC2 mainly captures the variation of f1z.

  • PC3 captures an inverse relationship between f2z and f3z, which might correspond to differences in coarticulatory effects.

The interpretation may be better facilitated by data visualisation. Let’s try two main approaches to data visualiastion of PCA.

Data visualisation 1: Scree plot

The amount of variance is useful when determining which PC to retain for analysis. As a rule of thumb, Bayeen (2008) recommends that we retain PCs that explains greater than 5% of variance in the data.

A useful visualisation of the proportion of variance is scree plot. There is a default function screeplot() that lets you create a scree plot quite easily:

# scree plot: bar chart
screeplot(pca_mid)

# scree plot: line plot
screeplot(pca_mid, type = "line")

You can also create a scree plot using ggplot() by manually calculating the proportion of variance based on the standard deviation (see above for the formula). This is useful when you need more flexibility – e.g., to show the 5% thereshold suggested by Bayeen (2008).

## obtaining proportion of variance 
pca_var_exp <- pca_mid$sdev^2 / sum(pca_var)

# making var_explained as a tibble 
pca_var_exp <- tibble::as_tibble(pca_var_exp)

# adding column name
pca_var_exp <- pca_var_exp |>  
  tibble::as_tibble() |>  
  dplyr::mutate(
    PC = row_number()
  )

# create a plot
sclee_plot <- pca_var_exp |> 
  ggplot(aes(x = PC, y = value)) +
  geom_line() +
  geom_text(aes(label = round(value, digits = 3)*100), nudge_x = 0.2) +
  geom_label(aes(label = PC), label.padding = unit(0.40, "lines")) +
  geom_hline(yintercept = 0.05, linetype = 'dotted') +
  scale_x_continuous(breaks = c(1, 2, 3)) +
  labs(x = "Principal Component", y = "Variance Explained", title = "Proportion of Variance explained by each PC")

# showing plot
sclee_plot

Since there are only three PCs identified from the data, it doesn’t make much sense to visualise them. But this will be useful when prcomp() identifies more than a few PCs.

creating biplot

The interpretation of loading matrix was more or less straightforward in this case given that only three parameters were involved. It is easy to see, however, that this can get quite complicated when the original data have a greater number of dimensions.

Biplots help us better understand the loadings and PC scores for each observation. For example, the default function biplot() plots all observations along the PC1 and PC2 dimensions with eigenvectors shown in red arrows (i.e., the direction and the amount of weighting: also called loadings).

# default function for biplot
biplot(pca_mid)

Similarly to the scree plot, we could also create biplots manually using ggplot().

# Extract PCA scores (principal component scores for each observation)
pca_scores <- as.data.frame(pca_mid$x)

# Extract PCA loadings (the contribution of each original variable to the PCs)
pca_loadings <- as.data.frame(pca_mid$rotation)

# Combine scores and loadings into one data frame for plotting
scores_df <- cbind(pca_scores, df_mid[, c("segment", "language", "vowel")])

# Reshape the loadings for visualization
loadings_df <- pca_loadings |> 
  mutate(variable = rownames(pca_loadings))

# Plotting the PCA biplot with facets by language
ggplot() +
  geom_point(data = scores_df, aes(x = PC1, y = PC2, color = vowel, shape = vowel), size = 2, alpha = 0.5) + # Plot the PCA scores (observations)
  geom_segment(data = loadings_df, aes(x = 0, y = 0, xend = PC1 * 5, yend = PC2 * 5), arrow = arrow(type = "closed", length = unit(0.1, "inches")), color = "red") + # Plot the loadings (variables)
  geom_text(data = loadings_df, aes(x = PC1 * 5, y = PC2 * 5, label = variable), size = 4, color = "red") + # Add labels for the loadings
  stat_ellipse(data = scores_df, aes(x = PC1, y = PC2, color = vowel, fill = vowel), geom = "polygon", alpha = 0.2, level = 0.95) + 
  facet_grid(segment ~ language) +
  labs(title = "PCA biplot by language") +
  scale_color_manual(values = cbPalette) + 
  scale_fill_manual(values = cbPalette) + 
  theme(legend.position = "bottom",
        strip.text.y = element_text(angle = 360))

Functional Principal Component Analysis (FPCA)

Preliminaries

Installing/loading packages

# installing packages
# install.packages("tidyverse")
# install.packages("mgcv")
# install.packages("itsadug")
# install.packages("tidymv")
# install.packages("tidygam")

# importing packages
library(tidyverse)
library(fdapace)

Importing data set

Let’s import the data set. We are using the data set openly available on the Open Science Framework (OSF) repository.

# import the csv file "initial.liquid.dynamic.csv" from the "data" directory and save it as "df_dyn"
df_dyn <- readr::read_csv("data/initial.liquid.dynamic.csv")

Data wrangling

Check data

We always start with inspecting the data set using colnames().

# Let's check what columns the data frame contains
colnames(df_dyn)
##  [1] "...1"           "file"           "speaker"        "language"      
##  [5] "f0"             "duration"       "segment"        "previous_sound"
##  [9] "next_sound"     "word"           "f1"             "f2"            
## [13] "f3"             "time"           "IsApproximant"  "IsAcoustic"    
## [17] "note"           "gender"         "omit"           "Barkf1"        
## [21] "Barkf2"         "Barkf3"         "f2f1"           "f3f2"          
## [25] "Barkf2f1"       "Barkf3f2"       "position"       "context"       
## [29] "liquid"         "input_file"

Omitting irrelavent columns

We’ll omit the columns we don’t need.

# Let's check the number of "approximant" tokens
df_dyn |> 
  dplyr::group_by(IsApproximant) |> 
  dplyr::summarise() |> 
  dplyr::ungroup()
## # A tibble: 1 × 1
##   IsApproximant
##   <chr>        
## 1 yes
# Let's check the number of tokens of good recording quality
df_dyn |> 
  dplyr::group_by(IsAcoustic) |> 
  dplyr::summarise() |> 
  dplyr::ungroup()
## # A tibble: 1 × 1
##   IsAcoustic
##   <chr>     
## 1 yes
# How about 'omit'?
df_dyn |> 
  dplyr::group_by(omit) |> 
  dplyr::summarise() |> 
  dplyr::ungroup()
## # A tibble: 1 × 1
##    omit
##   <dbl>
## 1     0
# Remove columns that we no longer need
df_dyn <- df_dyn |> 
  dplyr::select(-c(IsApproximant, IsAcoustic, omit, Barkf1, Barkf2, Barkf3, Barkf2f1, Barkf3f2, f2f1, f3f2))

Let’s check the column names again.

colnames(df_dyn)
##  [1] "...1"           "file"           "speaker"        "language"      
##  [5] "f0"             "duration"       "segment"        "previous_sound"
##  [9] "next_sound"     "word"           "f1"             "f2"            
## [13] "f3"             "time"           "note"           "gender"        
## [17] "position"       "context"        "liquid"         "input_file"

Let’s also convert the context column into IPA symbols for a more intuitive representation:

# convert the ARPABET notation into IPA symbols
df_dyn <- df_dyn |> 
  dplyr::mutate(
    context = case_when(
      context == "AE" ~ "/æ/",
      context == "IY" ~ "/i/",
      context == "UW" ~ "/u/"
    )
  )

Checking the number of participants, tokens…

Let’s also obtain some descriptive statistics here. Note that we need to divide the number of rows by 11 to obtain the accurate number of tokens, as one token now has 11 time points.

# number of participants
df_dyn |> 
  dplyr::group_by(language) |> 
  dplyr::summarise(n = n_distinct(speaker)) |> 
  dplyr::ungroup()
## # A tibble: 2 × 2
##   language     n
##   <chr>    <int>
## 1 English     14
## 2 Japanese    31
# number of tokens per segment
df_dyn |> 
  dplyr::group_by(segment) |> 
  dplyr::summarise(n = n()/11) |> # divide by 11 time points
  dplyr::ungroup()
## # A tibble: 6 × 2
##   segment     n
##   <chr>   <dbl>
## 1 LAE1      511
## 2 LIY1      408
## 3 LUW1      299
## 4 RAE1      502
## 5 RIY1      481
## 6 RUW1      314

Data visualisation

Scaling formant frequencies

Do you remember how to visualise the dynamic data? The basic procedure is the same as in the static analysis; We first apply z-score normalisation to the formant frequencies to make sure that formant values are comparable across speakers.

df_dyn <- df_dyn |> 
  dplyr::group_by(speaker) |> # tell R to do the following iteration per speaker
  dplyr::mutate(
    f1z = as.numeric(scale(f1)), # scale f1 into z-score
    f2z = as.numeric(scale(f2)), # scale f2 into z-score
    f3z = as.numeric(scale(f3)) # scale f3 into z-score
  ) |> 
  dplyr::ungroup() # don't forget ungrouping

Descriptive statistics

Let’s check the mean and SD for both raw and normalised formant values: just see F1 for now. Note that the mean z-scores do not seem to look zero, but this is because computers are not very good at dealing with very small numbers (e.g., decimals) and some fluctuations occur in computing the values.

# check mean and sd of raw/scaled F1 values for each speaker
df_dyn |> 
  dplyr::group_by(speaker) |>
  dplyr::summarise(
    f1_mean = mean(f1),
    f1_sd = sd(f1),
    f1z_mean = mean(f1z),
    f1z_sd = sd(f1z)
  ) |> 
  dplyr::ungroup() 
## # A tibble: 45 × 5
##    speaker f1_mean f1_sd  f1z_mean f1z_sd
##    <chr>     <dbl> <dbl>     <dbl>  <dbl>
##  1 2d57ke     468.  140. -9.05e-17   1   
##  2 2drb3c     575.  243.  2.39e-16   1   
##  3 2zy9tf     459.  228. -2.41e-16   1   
##  4 3bcpyh     438.  142. -5.72e-18   1.00
##  5 3hsubn     537.  177.  6.11e-17   1   
##  6 3pzrts     458.  195. -1.44e-18   1.00
##  7 4ps8zx     467.  172.  2.19e-16   1   
##  8 54i2ks     453.  192.  1.43e-16   1.00
##  9 5jzj2h     412.  133.  2.49e-16   1.00
## 10 5upwe3     444.  189. -6.51e-17   1.00
## # ℹ 35 more rows

Visualisation

raw trajectories

Let’s visualise the raw data first:

# F2 - raw trajectories
df_dyn |> 
  ggplot(aes(x = time, y = f2z)) +
  geom_point(aes(colour = language, group = file), width = 0.3, alpha = 0.4) +
  geom_path(aes(colour = language, group = file), width = 0.3, alpha = 0.4) +
  geom_hline(yintercept = 0, linetype = "dashed", linewidth = 0.5) +
  scale_colour_manual(values = cbPalette) + 
  facet_grid(liquid ~ context) +
  labs(x = "time", y = "F2 (z-normalised)", title = "time-varying change in F2 frequency") +
  theme(strip.text.y = element_text(angle = 0))

smooths

Let’s also try plotting smoothed trajectories:

# F2 - smooths
df_dyn |> 
  ggplot(aes(x = time, y = f2z)) +
  # geom_point(aes(colour = language, group = file), width = 0.3, alpha = 0.1) +
  # geom_path(aes(colour = language, group = file), width = 0.3, alpha = 0.1) +
  geom_smooth(aes(colour = language, group = language), linewidth = 1.2, se = TRUE) +
  geom_hline(yintercept = 0, linetype = "dashed", linewidth = 0.5) +
  scale_colour_manual(values = cbPalette) + 
  facet_grid(liquid ~ context) +
  labs(x = "time", y = "F2 (z-normalised)", title = "smoothed time-varying change in F2 frequency") +
  theme(strip.text.y = element_text(angle = 0))
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Functional Principal Component Analysis (FPCA)

At this point, we tried fitting Generalised Additive Mixed-Effect Models (GAMMs) to investigate between-group differences in F2 dynamics. This was possible because we had some ideas of possible factors (e.g., vowel context, speaker groups). In a way, lots of decisions were made in a top-down manner.

We could also try another approach; a bottom-up approach. That is, does the data tell us anything about higher-order groupings? Do the vowel context/speaker group differences emerge from the data?

Here, let’s try Functional Principal Component Analysis (FPCA) as a bottom-up approach.

# IDs = token column; tVec = time column; yVec = variable column(s)
input_df <- fdapace::MakeFPCAInputs(IDs = df_dyn$file, tVec = df_dyn$time, yVec = df_dyn$f2z)

# Check if there's any issues with the data
fdapace::CheckData(input_df$Ly, input_df$Lt)

# No errors have been returned, so let's now run fPCA on the dynamic PC1 trajectory
df_dyn_fpca <- fdapace::FPCA(Ly = input_df$Ly, input_df$Lt, optns = list(plot = TRUE))

checking fpca results

# eigenvalues
df_dyn_fpca$lambda
## [1] 73.431312 13.589246  4.373711  1.285172
# the cumulative percentage of variance explained by the eigenvalue
df_dyn_fpca$cumFVE
## [1] 0.7859692 0.9314212 0.9782350 0.9919908
# list of sampling time
df_dyn_fpca$workGrid
##  [1]   0  10  20  30  40  50  60  70  80  90 100
# PC scores -> each row is 1 token, each column is one PC
head(df_dyn_fpca$xiEst)
##            [,1]     [,2]       [,3]        [,4]
## [1,]  1.3978292 5.260927  2.1324070 -2.52256410
## [2,] -1.5626684 5.595501  0.2863422 -1.84886236
## [3,] -0.6166366 4.291694  1.1572724 -1.35058648
## [4,] -5.5315817 2.679823  0.1415474  0.08740263
## [5,]  0.5428989 3.542994  3.8331163 -2.47430535
## [6,] -2.7433146 7.071773 -2.1509403 -1.88748022
# Mean curve
fdapace::GetMeanCurve(Ly = input_df$Ly, Lt = input_df$Lt, optns = list(plot = TRUE))

## $mu
##  [1] -0.70831425 -0.56725924 -0.32739985 -0.04873545  0.18428169  0.31887341
##  [7]  0.37257545  0.37071962  0.31025622  0.17889910 -0.08389668
## 
## $workGrid
##  [1]   0  10  20  30  40  50  60  70  80  90 100
## 
## $bwMu
## NULL
## 
## $optns
## $optns$userBwMu
## [1] 5
## 
## $optns$methodBwMu
## [1] "Default"
## 
## $optns$userBwCov
## [1] 10
## 
## $optns$methodBwCov
## [1] "Default"
## 
## $optns$kFoldMuCov
## [1] 10
## 
## $optns$methodSelectK
## [1] "FVE"
## 
## $optns$FVEthreshold
## [1] 0.99
## 
## $optns$FVEfittedCov
## NULL
## 
## $optns$fitEigenValues
## [1] FALSE
## 
## $optns$maxK
## [1] 9
## 
## $optns$dataType
## [1] "Dense"
## 
## $optns$error
## [1] TRUE
## 
## $optns$shrink
## [1] FALSE
## 
## $optns$nRegGrid
## [1] 11
## 
## $optns$rotationCut
## [1] 0.25 0.75
## 
## $optns$methodXi
## [1] "IN"
## 
## $optns$kernel
## [1] "epan"
## 
## $optns$lean
## [1] FALSE
## 
## $optns$diagnosticsPlot
## NULL
## 
## $optns$plot
## [1] TRUE
## 
## $optns$numBins
## NULL
## 
## $optns$useBinnedCov
## [1] TRUE
## 
## $optns$usergrid
## [1] FALSE
## 
## $optns$yname
## [1] "Ly"
## 
## $optns$methodRho
## [1] "vanilla"
## 
## $optns$verbose
## [1] FALSE
## 
## $optns$userMu
## NULL
## 
## $optns$userCov
## NULL
## 
## $optns$methodMuCovEst
## [1] "cross-sectional"
## 
## $optns$userRho
## NULL
## 
## $optns$userSigma2
## NULL
## 
## $optns$outPercent
## [1] 0 1
## 
## $optns$useBinnedData
## [1] "AUTO"
## 
## $optns$useBW1SE
## [1] FALSE
## 
## $optns$imputeScores
## [1] TRUE
# plot
plot(df_dyn_fpca)

# scree plot
CreateScreePlot(df_dyn_fpca)

# path plot
CreatePathPlot(df_dyn_fpca, xlab = "normalised time", ylab = "F2 (z-normalised)")

# function: get PC scores + return data frame with PCs for each token
get_pc_scores <- function(fpcaObj){
  pcs <- data.frame(fpcaObj$xiEst)
  token <- names(fpcaObj$inputData$Lt) 
  df <- cbind(token, pcs)
  n_pcs <- length(fpcaObj$lambda) # get number of PCs
  pc_names <- paste0("PC", 1:n_pcs) # create colnames for PCs
  names(df) <- c("file", pc_names) # add colnames for token + PCs
  return(df)
}

# get PC scores w/ token info
pc_df <- get_pc_scores(df_dyn_fpca)

# join PCs (dat) with selected cols from original data frame 
## store meta info
meta <- df_dyn |>  
  select(speaker, gender, language, word, liquid, context, file)

## merge the list and meta data - unique(meta) because otherwise there would be lots of duplicates
dat <- left_join(pc_df, unique(meta), by = "file")